AITopics | advantage function

Collaborating Authors

advantage function

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

992f0fed0720dbb9d4e060d03ed531ba-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 11:45:57 GMT

Decision making in these systems has a long history, yet, if the state is not fully observed acting optimally in such systems is notoriously hard.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > Florida > Orange County > Orlando (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

Mou, Wenlong

arXiv.org Machine LearningFeb-9-2026

We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage functions directly from data, without unrealistic structural assumptions on the dynamics. Leveraging the ellipticity of the diffusions, we establish a new class of Hilbert-space positive definiteness and boundedness properties for the Bellman operators. Based on these properties, we propose the Sobolev-prox fitted $q$-learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv) numerical discretization error. These results identify ellipticity as a key structural property that renders reinforcement learning with function approximation for Markov diffusions no harder than supervised learning.

approximation, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2602.0693

Country:

North America > Canada > Ontario > Toronto (0.86)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.81)

Add feedback

4d893f766ab60e5337659b9e71883af4-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 20:13:15 GMT

advantage function, experiment, standard error, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

4d893f766ab60e5337659b9e71883af4-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 20:13:12 GMT

advantage function, causal effect, learning, (12 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
North America > United States (0.04)
Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Direct Advantage Estimation

Neural Information Processing SystemsDec-24-2025, 04:21:59 GMT

The predominant approach in reinforcement learning is to assign credit to actions based on the expected return. However, we show that the return may depend on the policy in a way which could lead to excessive variance in value estimation and slow down learning. Instead, we show that the advantage function can be interpreted as causal effects and shares similar properties with causal representations. Based on this insight, we propose Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from on-policy data while simultaneously minimizing the variance of the return without requiring the (action-)value function. We also relate our method to Temporal Difference methods by showing how value functions can be seamlessly integrated into DAE. The proposed method is easy to implement and can be readily adapted by modern actor-critic methods. We evaluate DAE empirically on three discrete control domains and show that it can outperform generalized advantage estimation (GAE), a strong baseline for advantage estimation, on a majority of the environments when applied to policy optimization.

direct advantage estimation, electronic proceedings, name change, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.98)

Add feedback

Cross-Domain Offline Policy Adaptation with Dynamics- and Value-Aligned Data Filtering

Qiao, Zhongjian, Yang, Rui, Lyu, Jiafei, Bai, Chenjia, Li, Xiu, Yang, Zhuoran, Gao, Siyang, Qiu, Shuang

arXiv.org Artificial IntelligenceDec-3-2025

Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between the source and target domain, simply merging the data from two datasets may incur inferior performance. Recent advances address this issue by selectively sharing source domain samples that exhibit dynamics alignment with the target domain. However, these approaches focus solely on dynamics alignment and overlook \textit{value alignment}, i.e., selecting high-quality, high-value samples from the source domain. In this paper, we first demonstrate that both dynamics alignment and value alignment are essential for policy learning, by examining the limitations of the current theoretical framework for cross-domain RL and establishing a concrete sub-optimality gap of a policy trained on the source domain and evaluated on the target domain. Motivated by the theoretical insights, we propose to selectively share those source domain samples with both high dynamics and value alignment and present our \textbf{\underline{D}}ynamics- and \textbf{\underline{V}}alue-aligned \textbf{\underline{D}}ata \textbf{\underline{F}}iltering (DVDF) method. We design a range of dynamics shift settings, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, as well as in challenging extremely low-data settings where the target domain dataset contains only 5,000 transitions. Extensive experiments demonstrate that DVDF consistently outperforms prior strong baselines and delivers exceptional performance across multiple tasks and datasets.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2512.02435

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

HCPO: Hierarchical Conductor-Based Policy Optimization in Multi-Agent Reinforcement Learning

Liu, Zejiao, Tu, Junqi, Hong, Yitian, Xiong, Luolin, Jin, Yaochu, Tang, Yang, Li, Fangfei

arXiv.org Artificial IntelligenceNov-18-2025

In cooperative Multi-Agent Reinforcement Learning (MARL), efficient exploration is crucial for optimizing the performance of joint policy. However, existing methods often update joint policies via independent agent exploration, without coordination among agents, which inherently constrains the expressive capacity and exploration of joint policies. To address this issue, we propose a conductor-based joint policy framework that directly enhances the expressive capacity of joint policies and coordinates exploration. In addition, we develop a Hierarchical Conductor-based Policy Optimization (HCPO) algorithm that instructs policy updates for the conductor and agents in a direction aligned with performance improvement. A rigorous theoretical guarantee further establishes the monotonicity of the joint policy optimization process. By deploying local conductors, HCPO retains centralized training benefits while eliminating inter-agent communication during execution. Finally, we evaluate HCPO on three challenging benchmarks: Star-CraftII Multi-agent Challenge, Multi-agent MuJoCo, and Multi-agent Particle Environment. The results indicate that HCPO outperforms competitive MARL baselines regarding cooperative efficiency and stability.

conductor, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2511.12123

Country:

Asia > China (0.28)
Asia > Middle East (0.28)
North America > United States (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.66)

Add feedback

7d3298e48220b289318b533a848ea069-Paper-Conference.pdf

Neural Information Processing SystemsNov-15-2025, 05:57:30 GMT

optimal transport discrepancy, policy update, trust region, (11 more...)

Neural Information Processing Systems

Country:

North America > United States (0.28)
Europe > Switzerland > Zürich > Zürich (0.05)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.46)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

73634c1dcbe056c1f7dcf5969da406c8-AuthorFeedback.pdf

Neural Information Processing SystemsNov-14-2025, 06:22:51 GMT

multi-task policy, reviewer, task parameter, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.31)

Add feedback

Sequential Multi-Agent Dynamic Algorithm Configuration

Lu, Chen, Xue, Ke, Yuan, Lei, Wang, Yao, Wang, Yaoyuan, Fu, Sheng, Qian, Chao

arXiv.org Artificial IntelligenceOct-28-2025

Dynamic algorithm configuration (DAC) is a recent trend in automated machine learning, which can dynamically adjust the algorithm's configuration during the execution process and relieve users from tedious trial-and-error tuning tasks. Recently, multi-agent reinforcement learning (MARL) approaches have improved the configuration of multiple heterogeneous hyperparameters, making various parameter configurations for complex algorithms possible. However, many complex algorithms have inherent inter-dependencies among multiple parameters (e.g., determining the operator type first and then the operator's parameter), which are, however, not considered in previous approaches, thus leading to sub-optimal results. In this paper, we propose the sequential multi-agent DAC (Seq-MADAC) framework to address this issue by considering the inherent inter-dependencies of multiple parameters. Specifically, we propose a sequential advantage decomposition network, which can leverage action-order information through sequential advantage decomposition. Experiments from synthetic functions to the configuration of multi-objective optimization algorithms demonstrate Seq-MADAC's superior performance over state-of-the-art MARL methods and show strong generalization across problem classes. Seq-MADAC establishes a new paradigm for the widespread dependency-aware automated algorithm configuration. Our code is available at https://github.com/lamda-bbo/seq-madac.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2510.23535

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.93)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback